Twitter sentiment analysis using NLP

The purpose of this project is to examine the dependencies between the situation in Afghanistan and tweets concerned Biden and Trump in August 2021 based on NLP techniques.

Table of contents

Versions:

  1. pandas == 1.2.4
  2. numpy == 1.19.5
  3. re == 2.2.1
  4. nltk == 3.6.2
  5. matplotlib == 3.3.4
  6. textblob == 0.15.3
  7. sklearn == 0.24.2
  8. wordcloud == 1.8.1
  9. google_trans_new == 1.1.9
  10. snscrape == 0.3.4
  11. langdetect == 1.0.9
  12. string
  13. datetime
  14. pandas_profiling == 3.0.0

1. Scraping tweets

Due to the date on which I started this project (27.08.2021) and the period which I would like to examine (since 01.08.2021), I had to use a library that allows scraping historical tweets. I decided to use snscrape. I noticed that the number of tweets scraping that way (using that library) is significantly smaller than the number of tweets scraped on an ongoing basis (another library). However, after checking several options, I was under the impression that my choices were severely restricted. So I gave snscrape a chance while being aware of limitations - bias and variance.

2. Cleaning data

The next step was cleaning the data. Unexpectedly, the Biden dataset included erroneously scraped data (the dates were not the dates), and some rows caused problems later for both Biden and Trump dataset so that rows were deleted from the initial datasets. The rest of the rows were cleaned by removing emojis, http(s) addresses, punctuation marks. The texts were reduced to lower case and joined hashtags (#KabulAiport) were divided (kabul airport).

The first two cleaned rows for Biden and Trump datasets:

Then the rows were examined by detecting language for each row.

3. Joining data since 27.08.2021

As a consequence of scraping the data in the middle of the month, the rest of the days (from 27.08.2021 until 01.09.2021) were joined to the original datasets.

4. Sentiment analysis - polarity & subjectivity

For calculating polarity and subjectivity, TextBlob library was used. However, the library is prepared to be called only for English texts. As a consequence, the non-English tweets had to be translated priorly.

5. Tweets languages

Due to the quite massive numbers of tweets in datasets, I decided to translate only a few of the most popular + Polish as my native language. Amid the most common, English took 1st place, Italian 2nd and Spanish or German 3rd. Taking into account mistakes which are made by used language detector and similarity between Italian and Spanish, I chose English (tremendous share), Spanish, German, French (Europe languages, similar, but noticeably different) and Polish (native language + curiosity).

The following language's analysis is based on determining polarity and subjectivity, and if needed also on prior translation. Analysis was extended by Word Clouds for English and Polish.

5.1. English

Biden profile

Trump profile

To create a Word Cloud, there is the necessity to determine the most popular words and their frequencies. To make clouds more meaningful, it is also a good practice to remove common words (called stop words).

I was not fully satisfied with the quality of determining stop words, so based on biden_dict and trump_dict, I added my own. Grabbed words sorted descending according to a number of occurrences were printed below.

Amid the most popular words, in addition to "biden" or "trump", there were also terms related to Afghanistan like: "taliban", "afghan", "afghanistan", "kabul" or "airport". Some words were not connected with current political situation at all (e.g. "border", "covid", "vaccine") and some were ambiguous: "help", "disaster", "left", "failure". In both word clouds, the words "biden" and "trump" occurred often.

5.2. Spanish

The following 3 analyses were done similarly - translation - polarity and subjectivity and presidents' profiles. So I am not going to focus on that.

Biden profile

Trump profile

5.3. German

Biden profile

Trump profile

5.4. French

Biden profile

Trump profile

5.5. Polish

For the Polish language, steps similar to the English language have been taken.

Biden profile

Trump profile

Similar to English, tweets in Polish covered similar "areas" (including tweets related to Afghanistan). Additionally, there were words related to the situation in Poland, e.g. "duda", "lgbt", "onet".

6. Polarity & Subjectivity - plots

The final step of my project was data visualization!
Datasets for this part consisted of tweets which languages were analyzed priorly, i.e. English, Spanish, German, French and Polish.

Firstly, I checked whether it is a clear difference between the days and number of Biden's tweets. For all languages, two dominant areas can be observed. The first one (higher bar approximately 15.08.2021) concerns the situation in Afghanistan - the withdrawal of troops and the Taliban's seizure of power in Afghanistan. The second one (lower bar approximately 26.08.2021) could concern Biden speech about the Kabul airport attack, which killed a dozen American troops, Biden confirmation of troops evacuation till 31.08.2021 or other issues which were addressed at the 25.08.2021 conference.

Similar (but not for all languages) observations can be spotted in Trump's dataset.

Looking at Polarity and Subjectivity in Biden and Trump's datasets, a preponderance of neutral tweets (0 - Polarity) can be observed for both presidents. The number of tweets for Polarity decreases smoothly (without including point 0.0). When it comes to Subjectivity, the highest number of tweets can be spotted for point 0.0 (lack of Subjectivity). The second-largest point goes to point 0.5, and since which the number of tweets flattens.

Finally, the data were grouped according to hours, and the means for polarity and subjectivity were calculated. For different languages, different thresholds were set:

  1. English - 6 h
  2. Spanish, German, French - 12 h
  3. Polish - 24 h

And the plots - polarity and subjectivity according to date were created.
As can be seen, the English tweets were characterized by low variance in comparison to other languages. English tweets were fluctuating around 0.0 - 0.1 for polarity and around 0.3 - 0.4 for subjectivity. The visible difference could be caused by tremendously higher share English tweets in tweets, change in meaning caused by translation and bypassing high polarity and subjectivity data in initial datasets.

Comparing datasets at one graph for each language revealed a more insightful view of datasets. For nearly all languages, a substantial polarity decrease for Biden after 11.08.2021 (withdrawing US forces from Afghanistan) could be seen. That relation is clearly visible for the English dataset.